Fix race condition: subscriber misses publisher-connected notification (No such feed) by vladopol · Pull Request #1254 · strukturag/nextcloud-spreed-signaling

vladopol · 2026-05-29T12:30:01Z

PR: Fix "No such feed" race condition on subscriber reconnect

Problem

When multiple clients reconnect simultaneously (e.g. after a proxy timeout or network
interruption), subscribers frequently fail to receive audio/video from publishers with
No such feed (N) error from Janus VideoRoom, leaving them unable to hear other
participants despite the call appearing active.

Root cause: lost wakeup in `async.Notifier`

The retry logic in sfu/janus/subscriber.go relies on newPublisherConnectedWaiter
which uses async.Notifier to wait until the publisher signals it is connected (via
webrtcup event → notifyPublisherConnected).

The race condition:

Time →
  Publisher: webrtcup fires → notifyPublisherConnected() → Notify("pub_X")
             [no waiters exist yet — notification is LOST]
  Subscriber: tries joinRoom → "No such feed"
              → creates waiter → waiter.Wait(ctx) blocks forever
              → context expires → subscriber gives up

async.Notifier.Notify() only wakes existing waiters; if none exist at the time of
notification, the event is permanently lost. Subscribers created after the publisher
connected never receive the signal and wait until context timeout.

Secondary issue: busy-loop after sticky fix

The original code creates the waiter once before the retry loop:

waiter, stop := p.mcu.newPublisherConnectedWaiter(...)
defer stop()

retry:
    // join attempt
    waiter.Wait(ctx)   // same object reused
    goto retry

If NewWaiter returns an already-closed channel (sticky behavior), subsequent
Wait() calls on the same waiter return immediately → tight busy-loop spamming
Janus with join requests until context expiry.

Fix

1. `async/notifier.go` — sticky notifications (30-second TTL)

Notifier now remembers the timestamp of the last Notify() call per key.
NewWaiter() checks: if the key was notified within the last 30 seconds, it returns
an already-closed channel so the caller retries immediately instead of waiting
indefinitely.

Reset() clears notifiedAt together with the waiter maps to prevent stale signals
after a publisher leaves and rejoins.

2. `sfu/janus/subscriber.go` — waiter created inside retry loop

Move newPublisherConnectedWaiter inside the JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED
case, creating a fresh waiter on every retry iteration:

retry:
    // join attempt
    case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED:
        waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType)
        if err := waiter.Wait(ctx); err != nil {
            stop()
            ...
        }
        stop()
        goto retry

This eliminates the busy-loop: each iteration creates a new waiter. If the publisher
was already connected (sticky signal within TTL), the new waiter returns immediately
and the retry join succeeds. If the publisher is not yet connected, the waiter blocks
until the signal arrives.

Observed behaviour after fix

Production system (Nextcloud Talk 23.0.4, spreed-signaling 2.1.1, Janus 1.4.0):

Before fix: No such feed errors in clusters of 10–44 per reconnect event,
affecting 2–5 users per incident, occurring ~10 times per day during business hours
After fix: No such feed = 0 across multiple reconnect events

Testing

The race can be reproduced by:

Starting a multi-participant call
Forcing simultaneous WebSocket reconnect of all HPB clients
(e.g. restart the signaling server or expire proxy timeout)
Observing whether subscribers receive audio after reconnect

Without fix: some subscribers get No such feed and never receive audio.
With fix: all subscribers successfully subscribe after retry.

When subscribers reconnect simultaneously with publishers (e.g. after a proxy timeout), the publisher's webrtcup event fires before the subscriber's waiter is created. The notification is lost, the subscriber waits until context timeout, and the call participant cannot hear others. Two issues fixed: 1. async/notifier.go: add sticky notification with 30-second TTL. Notify() now records the notification time. NewWaiter() checks if the key was notified recently and returns an already-closed channel so the caller retries immediately instead of waiting indefinitely. 2. sfu/janus/subscriber.go: move waiter creation inside the retry loop. The original code created one waiter before the loop and reused it, causing a tight busy-loop when the waiter channel was pre-closed by the sticky notifier. Creating a fresh waiter per retry eliminates this. Observed in production: clusters of 10-44 "No such feed" errors per reconnect event affecting 2-5 users, occurring ~10x/day. After fix: 0.

vladopol · 2026-06-02T23:11:27Z

Production validation

Running this patch in production for 5 days on a deployment with ~30 concurrent users (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, Talk Desktop v2.1.x on Windows):

Metric	Before patch	After patch (5 days)
"No such feed" errors	10–44 per reconnect event	0
Reconnect events/day	~10	observed 4 — zero errors
Users hearing silence	2–5 per event	0

The reconnect storms that triggered the race were caused by simultaneous WebSocket drops (clients that connected at similar times hit a proxy timeout together). After the fix, multiple identical reconnect events produced zero "No such feed" errors.

The 30-second TTL in notifier.go comfortably covers the typical subscriber reconnect window (observed: 1–8 seconds after publisher reconnects in our environment).

A dedicated issue with full reproduction details has been opened: #1257

fancycode · 2026-06-03T12:10:34Z

Thanks, I will be traveling for the next two weeks and will take a look when I'm back.

fancycode · 2026-06-03T18:59:45Z

Added an alternative fix in #1263 that doesn't need to remember the signaled events using a TTL map but keeps the notification object alive while the publisher still exists. This makes it clearer to reason about how long events are kept alive, especially in case of clients stopping and starting to publish.

Included a test that makes sure the publisher receives and signals the connected event before the subscriber waits for it (which would trigger the race without the change).

Would be great if you could test this in your environment, too. Thanks!

vladopol mentioned this pull request Jun 2, 2026

Race condition: subscriber misses publisher-connected notification → "No such feed" on reconnect #1257

Open

fancycode mentioned this pull request Jun 3, 2026

Fix race between publisher connected / subscriber waiting. #1263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254

Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254
vladopol wants to merge 1 commit into
strukturag:masterfrom
vladopol:fix/no-such-feed-race-condition

vladopol commented May 29, 2026

Uh oh!

vladopol commented Jun 2, 2026

Uh oh!

fancycode commented Jun 3, 2026

Uh oh!

fancycode commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vladopol commented May 29, 2026

PR: Fix "No such feed" race condition on subscriber reconnect

Problem

Root cause: lost wakeup in async.Notifier

Secondary issue: busy-loop after sticky fix

Fix

1. async/notifier.go — sticky notifications (30-second TTL)

2. sfu/janus/subscriber.go — waiter created inside retry loop

Observed behaviour after fix

Testing

Uh oh!

vladopol commented Jun 2, 2026

Production validation

Uh oh!

fancycode commented Jun 3, 2026

Uh oh!

fancycode commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Root cause: lost wakeup in `async.Notifier`

1. `async/notifier.go` — sticky notifications (30-second TTL)

2. `sfu/janus/subscriber.go` — waiter created inside retry loop